Take-home Exercise 1

A new article created using the Distill format.

Huan Li https://linkedin.com/in/huan-li-ab7498124/
2022-04-25

1. Overview

This take-home exercise aims to sharpen the skill of building data visualization programmatically using ggplot2 by exploring the insights of participants dataset.The data visualizations included in this exercise are: - Create a pareto chart showing the distribution of age group. - Create split violin plots showing the distribution of joviality by educationLevel and haveKids viarables.

2. Data Preparation

2.1 Installing and loading the required libraries

The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), ggrepel,patchwork and knitr.

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse')

for (p in packages){
  if(!require(p, character.only = T)){
    install.package(p)
  }
  library(p, character.only = T)
}
library(patchwork)

2.2 Importing the Dataset

The code chunk below import Participants.csv from data folder into R by using read_csv() of readr package and save it as an tibble data frame called participants.

participants <- read_csv("data/Participants.csv")
glimpse(participants)
Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

The table above shows that there are 7 columns 1011 rows in the dataset.

2.3 Explotatry Data Analysis of features

2.3.1 Dstribution of univariable

Before dig deep into the data, we use ggplot to visualize the distribution of all the columns.

p_householdsize <- ggplot(data = participants,
       aes(x = householdSize))+
  geom_bar(color="grey25", 
           fill="grey90") +
  ggtitle("Distribution of Household Size")

p_havekids <- ggplot(data=participants, 
             aes(x = haveKids)) +
  geom_bar(color="grey25", 
           fill="grey90") + 
  ggtitle("Distribution of Have Kids")

p_age <- ggplot(data=participants, 
             aes(x = age)) +
  geom_histogram(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  coord_cartesian(xlim=c(16, 61)) +
  ggtitle("Distribution of Age")

p_edu <- ggplot(data=participants, 
             aes(x = educationLevel)) +
  geom_bar(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  ggtitle("Distribution of Education Level")

p_interest <- ggplot(data=participants, 
             aes(x = interestGroup)) +
  geom_bar(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  ggtitle("Distribution of Interest Group")

p_joviality<- ggplot(data=participants, 
             aes(x = joviality)) +
  geom_density() +
  ggtitle("Distribution of Joviality")

(p_householdsize/p_havekids/p_interest)|(p_age/p_edu/p_joviality)

Figures above show that the data of householdSize, interestGroup and age are equally distributed. For other viarables, we need to explore it further.

2.3.2 Linkage between variables

Let’s have a look at the relationship between variables.

ggplot(data = participants,
       aes(x = educationLevel, fill = haveKids))+
  geom_bar()

Figure above shows that higher the education leval, lower possibility to have kids.

ggplot(data = participants,
       aes(x = age, fill = haveKids))+
  geom_bar()

3. Proposed Sketch

We wanted to further explore the distribution of age, education, and Joviality, so we decided to use two more complex charts, that’s pareto chart of age group and violin plots of joviality by educationLeval and haveKids.

4. Pareto Chart of age

Pareto charts show the ordered frequency counts of data. They show the ordered frequency counts of values for the different levels of a categorical or nominal variable. These charts are often used to identify areas to focus on first in process improvement, as supported by the Pareto Principle (80/20 Rule).

4.1 Data Wrangling

4.1.1 Data binning on age variable

The following code shows how to perform data binning on the age variable using the cut() and mutate() function with specific break marks:

part_age<- participants %>%
  mutate(ageGroup = cut(age, breaks = c(17,25,35,45,55,60)))
part_age
# A tibble: 1,011 x 8
   participantId householdSize haveKids   age educationLevel     
           <dbl>         <dbl> <lgl>    <dbl> <chr>              
 1             0             3 TRUE        36 HighSchoolOrCollege
 2             1             3 TRUE        25 HighSchoolOrCollege
 3             2             3 TRUE        35 HighSchoolOrCollege
 4             3             3 TRUE        21 HighSchoolOrCollege
 5             4             3 TRUE        43 Bachelors          
 6             5             3 TRUE        32 HighSchoolOrCollege
 7             6             3 TRUE        26 HighSchoolOrCollege
 8             7             3 TRUE        27 Bachelors          
 9             8             3 TRUE        20 Bachelors          
10             9             3 TRUE        35 Bachelors          
# ... with 1,001 more rows, and 3 more variables:
#   interestGroup <chr>, joviality <dbl>, ageGroup <fct>

4.1.2 Compute the frequency by age

To achieve the frequency count of age, group_by() of dplyr package is used to group data by age. Then, summarise() of dplyr is used to count (i.e. n()) the number of each age group.

dit_age<- part_age %>%
  group_by(`ageGroup`) %>%
  summarise('Counts'=n()) %>%
  ungroup()
dit_age
# A tibble: 5 x 2
  ageGroup Counts
  <fct>     <int>
1 (17,25]     184
2 (25,35]     243
3 (35,45]     226
4 (45,55]     244
5 (55,60]     114

4.1.3 Sorting Data

By default, the values of the tibble data frame is sorted according to the values of the first column. We will need to sort the age group by descending order of counts in the age group. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.

freq_sorted <- dit_age %>%
  arrange(desc(Counts))
head(freq_sorted)
# A tibble: 5 x 2
  ageGroup Counts
  <fct>     <int>
1 (45,55]     244
2 (25,35]     243
3 (35,45]     226
4 (17,25]     184
5 (55,60]     114

4.1.4 Compute the Cumulative Frequency

Lastly, we can compute the cumulative frequency of age group. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.

The newly computed cumulative frequency values will be stored in a new field called cumfreq.

freq_cum <- freq_sorted %>%
  mutate(cumfreq = cumsum(Counts)) %>%
  mutate(cum = cumsum(Counts)/sum(Counts))
head(freq_cum)
# A tibble: 5 x 4
  ageGroup Counts cumfreq   cum
  <fct>     <int>   <int> <dbl>
1 (45,55]     244     244 0.241
2 (25,35]     243     487 0.482
3 (35,45]     226     713 0.705
4 (17,25]     184     897 0.887
5 (55,60]     114    1011 1    

4.2 Plot Pareto Chart

4.2.1 Static Pareto Chart

A pareto chart was plotted using ggplot2 as follows: geom_col() instead of geom_bar() was used to create the bar chart as we do not need to modify the data, and want the height of the bar to represent the actual counts of people in each age group. geom_line() and geom_point() was used for the line and scatter plot for the cumulative frequency of return counts. The scatter plot helps reader to identify the corresponding cumulative frequency for people in each age group in a static chart. scale_y_continuous() was used to adjust the interval between the grid lines and add a secondary y axes for the cumulative percentage of people counts for each age group. The secondary y axes is just based on a one-to-one transformation of the primary axes, hence both bar and line charts are still plotted with reference to the primary axes. theme() was lastly used to adjust the background color and grid lines of the plot to improve visibility to readers.

pa <- ggplot(data = freq_cum, 
       aes(x = reorder(`ageGroup`,-`Counts`))) +
  geom_col(aes(y=`Counts`), fill = 'lightblue', width= 0.8) +
  geom_point(aes(y=`cumfreq`), color = 'grey20', size = 0.8) +
  geom_line(aes(y =`cumfreq`, group = 1), colour = 'grey20', size = 0.4) +
  labs(x = "ageGroup", title = "Pareto Chart of Age Group") +
  scale_y_continuous(
    name = 'Absolute Frequency', breaks = seq(0, 3500, 300), minor_breaks = seq(0, 3500, 100),
    sec.axis = sec_axis(~.*1/sum(freq_cum$Counts), name = 'Cumulative Frequency', breaks = seq(0, 1, by = 0.1), labels = scales::percent)
  )+
    geom_hline(yintercept=0.8*sum(freq_cum$Counts), linetype="dashed", color = "grey50") +
  geom_text(aes(3, 0.8*sum(freq_cum$Counts), label = "80.0%", vjust = -0.5), size = 2) +
  theme(panel.background = element_rect(fill = 'white', colour = 'grey60', size = 0.5, linetype = 'solid'),
        panel.grid.major = element_line(size = 0.3, linetype = 'solid', colour = 'grey85'),
        panel.grid.minor = element_line(size = 0.2, linetype = 'solid', colour = 'grey90'),
        text = element_text(size = 5.5),
        axis.ticks.x = element_blank()) 
pa
pa

4.2.2 Interactive Pareto Chart

From the previous chart plotted using ggplot2, the secondary y axes was based on a one-to-one transformation of the primary y axes. The bar and line charts were plotted with reference to the primary y axes, hence the secondary y axes will not be transferred to the interactive chart using ggplotly() of plotly library.

Therefore, an interactive pareto chart was plotted using plot_ly() from plotly. A scatter plot was not included in this case as one can simply hover across the line to view the corresponding cumulative percentage for each age group. Similarly, hovering on the bars will reveal the absolute frequency of people counts for each age group.

Call the library first:

plot_ly(freq_cum, x = ~reorder(`ageGroup`,-`Counts`), y = ~`Counts`, type = "bar", name = "Counts") %>%
  add_trace(x = ~reorder(`ageGroup`,-`Counts`), y = ~`cum`*100,type = "scatter", mode = "lines", yaxis = "y2", name = "Cum. %") %>%
  plotly::layout(title = "Pareto chart of ageGroup",
         xaxis = list(title = "ageGroup"),
         yaxis = list(title = "Counts (Absolute Frequency)", showgrid = F),
         yaxis2 = list(overlaying = "y", side = "right", title = "Cumulative Percentage (%)", range = list(0, 100)),
         legend = list(orientation="h", yanchor="bottom",y=0.9,xanchor="top",x=0.2)) 

4.3 Insights from Visualization

From the pareto chart, we can tell that in the dataset, there are less people for people who are older than 55 years old and younger than 25 years old.The others take up g=for about 80%, which is

5. Split violin plots

#5.1 Package Installation A new package [introdataviz] need to be installed to visualize the split violin plots.

devtools::install_github("psyteachr/introdataviz")

5.2 Creat split violin plots

5.2.1 violin plot

Code chunk below is used to create the split violin plots.

ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
  introdataviz::geom_split_violin()

5.2.2 Add boxplot on it

In order to more clearly show the distribution difference of Joviality among people with different education levels and people with or without children, box plot needs to be superimposed on the violin chart.

ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
  introdataviz::geom_split_violin() +
  geom_boxplot()

5.2.3 Change the Format

ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
  introdataviz::geom_split_violin(alpha = .4, trim = FALSE) +
  geom_boxplot(width = .2, alpha = .6, fatten = NULL, show.legend = FALSE) +
  stat_summary(fun.data = "mean_se", geom = "pointrange", show.legend = F, 
               position = position_dodge(.175)) +
  scale_y_continuous(breaks = seq(0, 1, 0.2), 
                     limits = c(0, 1)) +
  scale_fill_brewer(palette = "Dark2", name = "Have Kids")

5.3 Insights from Visualization

From the plot, we can tell that: - For those who does not have kids, higher the education level, higher the joviality level.But for those who has kids, it’s not the case-group from the low education level have a higher joviality level. - The joviality differs the most for people from the lowest education group when comes to whether have kids affect the level of happiness.The median of those who comes from low education group and does not have kids is the lowest, the median of those who comes from low education group but have kids is the hightes. - The medians of groups except for low education are in a similar level no matter whether they have kids or not.